Efficient User-Level Thread Migration and Checkpointing on Windows NT Clusters
نویسندگان
چکیده
ion of running on a single shared memory multiprocessor, Brazos supports message passing by implementing the MPI library [20]. Thread migration in the context of a distributed system involves the movement of a computation thread from one currently executing process to another running process. Thread migration has been previously proposed as a tool for load-balancing and communication reduction in distributed shared memory systems [13, 23]. Our work extends the use of thread migration to fault tolerance and cluster management. Migration can be used to tolerate shutdowns due to scheduled maintenance or power loss by dynamically moving all computation threads and necessary data of the application to another available node, without restarting the application. Migration can also be used to add or remove multiprocessor nodes on-the-fly by relocating existing computation threads to the new nodes as appropriate. Finally, the runtime system or programmer may elect to migrate a thread to another node in cases where moving the thread to the data is a better option than moving the data to the thread. Applications that run for a long time or that require high-availability need a means of recovering from failures, while minimizing the runtime overhead required to ensure recoverability. Previous work in distributed fault tolerance schemes can be categorized as either transaction or checkpoint-based, although combinations of both have been used. Transactionbased recovery is similar to database recovery, in that the distributed system maintains a list of memory transactions or messages [5]. Single node failures can be tolerated by replaying the transactions related to the failed node. Checkpointing is used to save the state of a process. In case of a failure, the checkpoint files are applied and computation can proceed from the point of the last checkpoint [1, 22]. Systems that combine transactions and checkpoints attempt to minimize the amount of work lost due to failure as well as the space requirements for recovery data. Our implementation of checkpointing is distinguished in two ways. First, we minimize the amount of data saved during a checkpoint operation by leveraging some of the existing coherence-related information available in the Brazos runtime system. This reduces both the overhead required to create checkpoints and the time needed to recover from failures. Second, our checkpoint facility can be initiated either explicitly upon user request or implicitly using predetermined checkpointing intervals. Our results indicate that the facility, given an appropriate choice of checkpoint interval, exhibits low execution time overhead and fast recovery times. The rest of the paper is organized as follows. In Section 2 we described the design and performance of the Brazos thread migration mechanism. Section 3 contains a similar analysis of the Brazos checkpointing mechanism. In Section 4, we describe how thread migration and checkpoints can be combined to perform several fault tolerance and cluster management functions. Related work is discussed in Section 5. We conclude and describe future research directions in Section 6.
منابع مشابه
Illinois-Intel Multithreading Library: Multithreading Support for Intel Architecture Based Multiprocessor Systems
Powerful desktop multiprocessor systems based on the Intel Architecture (iA) offer a formidable alternative to traditional scientific/engineering workstations for commercial application developers at an attractive costperformance ratio. However, the lack of adequate compiler and runtime library support for multithreading and parallel processing on Windows NT* makes it difficult or impossible to...
متن کاملData Conversion for Process/Thread Migration and Checkpointing
Process/thread migration and checkpointing schemes support load balancing, load sharing and fault tolerance to improve application performance and system resource usage on workstation clusters. To enable these schemes to work in heterogeneous environments, we have developed an application-level migration and checkpointing package, MigThread, to abstract computation states at the language level ...
متن کاملNanothreads vs. Fibers for the Support of Fine Grain Parallelism on Windows NT/2000 Platforms
Support for parallel programming is very essential for the efficient utilization of modern multiprocessor systems. This paper focuses on the implementation of multithreaded runtime libraries used for the fine-grain parallelization of applications on the Windows 2000 operating system. We have implemented and introduce two runtime libraries. The first one is based on standard Windows user-level f...
متن کاملCoordinated Thread Scheduling for Workstation Clusters Under Windows NT
Coordinated thread scheduling is a critical factor in achieving good performance for tightly-coupled parallel jobs on workstation clusters. We are building a coordinated scheduling system that coexists with the Windows NT scheduler which both provides coordinated scheduling and can generalize to provide a wide range of resource abstractions. We describe the basic approach, called “demand-based ...
متن کاملEfficient User-Level Thread Migration and Checkpointing on Win
ion of running on a single shared memory multiprocessor, Brazos supports message passing by implementing the MPI library [20]. Thread migration in the context of a distributed system involves the movement of a computation thread from one currently executing process to another running process. Thread migration has been previously proposed as a tool for load-balancing and communication reduction ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999